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Abstract —Efficient Index structures for fast approximate near- 
est neighbor queries are required in many applications such 
as recommendation systems. In high-dimensional spaces, many 
conventional methods suffer from excessive usage of memory 
and slow response times. We propose a method where multiple 
random projection trees are combined by a novel voting scbeme. 
Tbe key idea is to exploit tbe redundancy in a large number of 
candidate sets obtained by Independently generated random pro- 
jections in order to reduce tbe number of expenslve exact distance 
evaluations. Tbe metbod is straigbtforward to implement using 
sparse projectlons wbicb leads to a reduced memory footprlnt 
and fast Index construction. Furtbermore, it enables grouping of 
tbe required computations into big matrix multipbcations, wbicb 
leads to additional savings due to cacbe effects and low-level 
parallelizatlon. We demonstrate by extensive experiments on a 
wide variety of data sets tbat tbe metbod is faster tban existing 
partitioning tree or basbing based approacbes, making it tbe 
fastest available tecbnique on bigb accuracy levels. 

Index Terms —Nearest Neigbbor Searcb; Random Projectlons; 
Higb Dimensionality; Approximatlon Algoritbms 

I. INTRODUCTION 

Nearest neighbor search is an essential part of many ma- 
chine learning algorithms. Often, it is also the most time- 
consuming stage of the procedure, see 0 - In applications 
areas, such as in recommendation systems, robotics and com¬ 
puter Vision, where fast response times are critical, using brute 
force linear search is often not feasible. This problem is further 
magnified by the availability of inreasingly complex, high- 
dimensional data sources. 

Applications tbat require frequent fc-NN queries from large 
data sets are for example object recognition 0, 0, shape 
recognition 0 and image completion 0 using large databases 
of image descriptors, and content-based web recommendation 
systems 0. 

Consequently, there is a vast body of literature on the 
algorithms for fast nearest neighbor search. These algorithms 
can be divided into exact and approximate nearest neighbor 
search. While exact nearest neighbor search algorithms return 
the Uue nearest neighbors of the query point, they suffer from 
the curse of dimensionality: their performance degrades when 
the dimension of the data increases, rendering them no hetter 
than brute force linear search in the high-dimensional regime. 

In approximate nearest neighbor search, the main interest is 
the tradeoff between query time and accuracy, which can be 


measured either in terms of distance ratios or the probability of 
finding the true nearest neighbors. Several different approaches 
have been proposed and their implementations are commonly 
used in practical applications. 

In this section, we first define the approximate nearest 
neighbor search problem, then review the existing approaches 
to it, and finally outline a new method that avoids the main 
drawbacks of existing methods. 


A. Approximate nearest neighbor search 

Consider a metric space AI, a data set X = (xi,..., x„) C 
AI, a query point q S AI, and a distance metric m : AI^ —>■ K. 
The fc-nearest neighbor (fc-NN) search is the task of finding 
the k closest (w.r.t. m) points to q from the data set X, i.e., 
find a set K C X for which it holds that |K| = fc and 

TO(q,x) < m(q,y) 


for all X e K, y e X \ K. 

In this Work, we consider applications where AI is the d- 
dimensional Euclidean space and m is Euclidean distance 


TO(x,y) 



y^y■ 


An approximate nearest neighbor search algorithm can be 
divided into two separate phases: an offline phase, and an 
Online phase. In the offline phase an index is built, and the 
Online phase refers to completing fast nearest neighbor queries 
using the constructed index. In practical applications, the most 
critical considerations are the accuracy of the approximatlon 
(the definition of the accuracy and the required accuracy level 
depending on the application), and the query time needed to 
reach it. 

As a measure of accuracy, we use recall, which is relevant 
especially for information retrieval and recommendation set- 
tings. It is defined as the proportion of ttue fc-nearest neighbors 
of the query point returned by the algorithm: 


Recall = 


|AnK| 

k 


Here A is a set of k approximate nearest neighbors returned 
by the algorithm, and K is the set of true k nearest neighbors 
of the query point. 





Of secondary importance are the index construction time 
in the offline stage and the memory requirement, which are 
usually not visible to the user but must be feasible. Note that in 
many applications the index must be updated regularly when 
we want to add new data points into the set from where the 
nearest neighbors are searched from. 


B. Related work 

Most effective methods for approximate nearest neighbor 
search in high-dimensional spaces can be classified into either 
hashing, graph, or space-partitioning tree based strategies. 

1) Hashing based algorithms: The most well-known and 
effective hashing based algorithms are variants of locality- 
sensitive hashing (LSH) Q-Ø- In LSH, several hash func- 
tions of the same type are used. These hash functions are 
locality-sensitive, which means that nearby points fall into the 
same hash bucket with higher probability than the points that 
are far away from each other. When answering a fc-NN query, 
the query point is hashed with all the hash functions, and 
then a linear search is performed in the set of data points in 
the buckets where the query point falls into. Multi-probe LSH 


p0)-| 121 improves the efficiency of LSH by searching also the 
nearby hash buckets of the bucket a query point is hashed into. 
Thus a smaller amount of hash tables is required to obtain a 
certain accuracy le vel. The choice of the hash function affects 
the performance and needs to be done carefully in each case. 

2) Graph based algorithms: Approximate graph based 
methods such as IIII-dD build a fc-NN-graph of the data 
set in an offline phase, and then utilize it to do fast fc-NN 
queries. For example in GD several graphs of random subsets 
of the data are used to approximate the exact fc-NN-graph of 
the whole data set. The data set is divided hierarchically and 
randomly into subsets, and the fc-NN-graphs of these subsets 
are built. This process is then repeated several times, and the 
resulting graphs are merged and the final graph is utilized to 
answer nearest neighbor queries. Graph based methods are in 
general efficient, but the index construction is slow for larger 
data sets. 

3) Space-partitioning trees: The first space partitioning- 
tree based strategy proposed for nearest neighbor search was 
the fc-d tree |T§, which divides the data set hierarchically into 
cells that are aligned with the coordinate axes. Nearest neigh¬ 
bors are then searched by performing a backtracking or priority 
search in this tree. fc-d trees and other space-partitioning trees 
can be utilized in approximate nearest neighbor search by 
terminating the search when a predeflned number of data 
points is checked, or all the points within some predeflned 
eiTor bound from the query point are checked Gl)- 

Instead of the hyperplanes, clustering algorithms can be 
utilized to partition the space hierarchically; for example the 
fc-means algorithm is used in fc-means trees ||T|, fT^ . Other 
examples of clustering based partitioning trees are cover trees 
P^ , VP trees | |20) and ball trees f2T). Similarly to graph- 
based algorithms, clustering based variants have the drawback 
of long index construction times due to hierarchical clustering 
on large data sets. 


An efficient scheme to increase the accuracy of space- 
partitioning trees is to utilize several parallel randomized 
space-partitioning trees. Randomized fc-d trees 1^ , are 
grown by choosing a split direction at random from dimen- 
sions of the data in which it has the highest variance. Nearest 
neighbor queries are answered using priority search; a single 
priority queue, which is ordered by distance of the query point 
from splitting point in the splitting direction, is maintained for 
all the trees. 

Another variant of randomized space-partitioning tree is the 
random projection tree (RP tree), which was first proposed for 
vector quantization in p4| , and later modified for approximate 
nearest neighbor search in 1^ . In random projection trees the 
splitting hyperplanes are aligned with the random directions 
sampled from the unit sphere instead of the coordinate axes. 
Nearest neighbor queries are answered using defeatist search: 
first the query point is routed down in several trees, and then 
a brute force linear search is performed in the union of the 
points of all the leaves the query point fell into. 

C. MRPT algorithm 

One of the strengths of randomized space-partitioning trees 
is in their relative simplicity compared to hashing and graph 
based methods; they have only few tunable parameters, and 
their performance is quite robust with respect to them. They 
are also perfectly suitable for parallel implementation because 
the trees are independent, and thus can be easily parallelized 
either locally or over the network with minimal communica- 
tion overhead. 

However, both of the aforementioned variants of random¬ 
ized space-partitioning trees have a few weak points which 
limit their efficiency and scalability. First, randomization used 
in randomized fc-d trees does not make the trees sufficiently 
decorrelated to reach high levels of accuracy efficiently. Ran¬ 
domization used in RP trees is sufficient, but it comes at 
high computational cost: computation of random projections 
is time-consuming for high-dimensional data sets because the 
random vectors have the same dimension as the data. 

Second, both the priority queue search used in p2) , | |2^ 
and the defeatist search used in p5) require checking a large 
proportion of the data points to reach high accuracy levels; 
this leads to slow query times. Third, because each node of 
an RP tree has its own random vector, the number of random 
vectors required is exponential with respect to tree depth. In 
the high-dimensional case this on the one hand increases the 
memory required by the index unnecessarily, and on the other 
hand slows down the index construction process. 

Because of the first two reasons the query times of the 
randomized space-partitioning trees are not competitive with 
the fastest graph based methods (cf. experimental results in 
Section 

In this articie we propose a method that uses multiple ran¬ 
dom projection trees (MRPT); it incorporates two additional 
features that lead to fast query times and accurate results. 
The algorithm has the aforementioned strengths of randomized 
space-partitioning trees but avoids their drawbacks. 



More specifically, our contributions are: 

1) We show that sparse random projections can be used 
to obtain a fast variant of randomized space-partitioning 
trees. 

2) We propose the MRPT algorithm that combines multiple 
trees by voting search as a new and more efficient 
method to utilize randomized space-partitioning trees for 
approximate nearest neighbor search. 

3) We present a time and space complexity analysis of our 
algorithm. 

4) We demonstrate experimentally that the MRPT algo¬ 
rithm with sparse projections and voting search out- 
performs the state-of-the-art methods using several real- 
world data sets across a wide range of sample size and 
dimensionality. 

In the following sections we describe the proposed method 
by first revisiting the classic RP tree method | [25) , and then 
describing the novel features that lead to increased accuracy 
and reduced memory footprint and query time. 


II. Index construction 
A. Classic random projection trees 

At the root node of the tree, the points are projected onto 
a random vector r generated from the d-dimensional standard 
normal distribution Nd{0, 1). The data set is then split into two 
child nodes using the median of the points in the projected 
space: points whose projected values are less or equal to the 
median in the projected space are routed into the left child 
node, and points with projected values greater than the median 
are routed into the right child node. So the points 

x(l),...,x(rV2l) 


fall into the left branch, and the points 

x(rn/2l+l)^ _ ^ X^"^ 


fall into the right branch. Here ..., denotes the 
ordering of the points Xi,..., x„ with respect to the values of 
their projections xfr,... ,x^r. This process is then repeated 
recursively for the subsets of the data in both of the child 
nodes, until a predehned depth t is met. 

For high-dimensional data, the computation of random 
projections is slow, and the memory requirement for storing 
the random vectors is high. To reduce both the computation 
time and the memory footprint, we propose two improvements: 

• Instead of using dense vectors sampled from the d- 
dimensional normal distribution, we use sparse vectors 
to reduce both storage and computation complexity. 

• Instead of using different projection vectors for each 
intermediate node of a tree, we use one projection vector 
for all the nodes on the same level of a tree, so that we 
can further reduce the storage requirement and maximize 
low-level parallelism through vectorization. 


In addition, instead of splitting at a fractile point chosen at 
random from the interval [j, |], as suggested in |25 , we split 


at the median to make the performance more predictable and 
to enable saving the trees in a more compact format. 

In the following subsections we briefly discuss the details 
of the above improvements. Pseudocode for constructing the 
index using sparse RP trees is given in detail in Algorithms[TJ|^ 
below. The proposed Algorithm [^consists of three embedded 
for loops. The outermost loop (line 4 - 15) builds T RP 
trees by continuously calling the GROW_TREE function in 
Algorithm For each individual RP tree, the two inner for 
loops (line 6-12) prepare a sparse random matrix R which 
will be used for projecting the data set X to P using i random 
vectors (at line 13). Specihcally, the innermost loop is for 
constructing a d-dimensional sparse vector for a given tree 
level. In Algorithm the GROW_TREE function constructs a 
binary search tree by recursively splitting the high-dimensional 
space X into sub-spaces using the previous projection P. 


Algorithm 1 Grow T RP trees. 

1 : function GROW_trees(X, T, f, o) 

2: n ^ X.nrows 

3: let trees [1... T] be a new array 

4: for t in 1,..., r do 

5: let R be a new d x i matrix 

6: for level in 1,... do 

7: for i in 1,..., d do 

8: generate z from Bernoulli(a) 

9: if z = 1 then 

10: generate R[i, level] from N{0, 1) 

11: else 

12: R[i, level] ^ 0 

13: P 4- XR 

14: trees[t].root ^ GROW_TREE(X, [1... n], 0, P) 

15: trees[t].random_matrix ^ R 

16: return trees 


Algorithm 2 Grow a single RP tree. 

1 : function grow_tree(X, indices, level, P) 

2: if level = i then 

3: return X[indices, ] as leaf 

4: proj P [indices, level] 

5: split ^ median(proj) 

6: left_indices ^ indices [proj < split] 

7: right indices ^ indices[proj > split] 

8: left ^ GROW_TREE(X, leftjndices, level H- 1, P) 

9: right ^ GROW_TREE(X, rightjndices, level H- 1, P) 

10: return split, left, right 


B. Sparse random projections 

With high-dimensional data sets, computation of the random 
vectors easily becomes a bottleneck on the performance of the 
algorithm. However, it is not necessary to use random vectors 
sampled from the d-dimensional standard normal distribution 
to approximately preserve the pairwise distances between the 











data points. Achlioptas |26| shows that the approximately 

distance-preserving low-dimensional embedding of Johnson- 

Lindenstrauss-lemma is obtained also with sparse random vec- 

tors with components sampled from { —1,0,1} with respective 

|27) prove that the same 

1 _ A_ 

_. ' \/rf ’ 2\/fi ’ 

where d is the dimension of the data, can be used to obtain 

a >/d-fold speed-up without significant loss in accuracy com- 
pared to using normally distributed random vectors. 

We use sparse random vectors r = {ri,...,rd), whose 
components are sampled from the standard normal distribution 
with probability a, and are zeros with probability 1 — a : 


probabilities Li et al. 

components with respective probabilities { , 


n = 


iV(0,l) 

0 


with probability a 
with probability 1 — a. 


The sparsity parameter a can be tuned to optimize perfor¬ 
mance but we have observed that a — ^ recommended in 

_ \/d 

p7) tends to give near-optimal results in all the data sets we 
tested, which suggests that further fine-tuning of this parameter 
is unnecessary. This proportion is small enough to provide 
significant computational savings through the use of sparse 
matrix libraries. 


C. Compactness and speed with fewer vectors 

In classic RP trees, a different random vector is used at each 
inner node of a tree, whereas we use the same random vector 
for all the sibling nodes of a tree. This choice does not affect 
the accuracy at all, because a query point is routed down each 
of the trees only once; hence, the query point is projected onto 
a random vector r i sampled from the same distribution at each 
level of a tree. This means that the query point is projected 
onto i.i.d. random vectors ri,..., in both scenarios. 

An RP tree has 2^ — 1 inner nodes; therefore, if each node 
of a tree had a different random vector as in classic RP trees, 
2^—1 different random vectors would be required for one tree. 
However, when a single vector is used on each level, only 
£ vectors are required. This reduces the amount of memory 
required by the random vectors from exponential to linear with 
respect to the depth of the trees. 

Having only £ random vectors in one tree also speeds up the 
index construction significantly. While some of the observed 
speed-up is explained by a decreased amount of the random 
vectors that have to be generated, mostly it is due to enabling 
the computation of all the projections of the tree in one 
matrix multiplication: the projected data set P G can be 
computed from the data set X G and a random matrix 

R G as 

P = XR. 

Although the total amount of computation stays the same, 
in practice this speeds up the index construction significantly 
due to the cache effects and low-level parallelization through 
vectorization. 


D. Time and space complexity: Index construction 

At each level of each tree the whole data set is projected 
onto a d-dimensional random vector that has on average 
ad non-zero components, so the expectecj^ index construc¬ 
tion time is Q{T£nad). For classic RP trees (a = 1) this 
is Q{T£nd), but for sparse trees (a = this is only 

Q{T£ny/d). 

For each tree, we need to store the points allocated to each 
leaf node, so the memory required by the index is at least 
O (Tn). At each node only a split point is saved; this does not 
increase the space complexity because there are only 2^-1 < 
n inner nodes in one tree. 

The expected amount of memory required by one random 
vector is Q{ad) when random vectors are saved in sparse 
matrix form. This is Q{d) for dense RP trees, and ©(v/d) for 
sparse RP trees with the sparsity parameter hxed to o = ^. 
Because an RP tree has 2^—1 inner nodes, the memory 
requirement for T classic RP trees, which have a different 
random vector for each node of a tree, is O {Tdn). Flowever, 
in our version, in which a single vector is used on each level, 
there are only £ vectors; hence, the memory requirement for 
T sparse RP trees is O (^Tl^s/dlogn + n)^. 

III. Query phase 

In many approximate nearest neighbor search algorithms 
the query phase is further divided into two steps; a candidate 
generation step, and an exact search step. 

In the candidate generation step, a candidate set S, for which 
usually |S'| <C n, is retrieved from the whole data set, and 
then in the exact search step k approximate nearest neighbors 
of a query point are retrieved by performing a brute force 
linear search in the candidate set. In the MRPT algorithm, 
the candidate generation step consists of traversal of T trees 
grown in the index construction phase. 

The leaf to which a query point q belongs to is retrieved 
by hrst projecting q at the root node of the tree onto the same 
random vector as the data points, and then assigning it into the 
left or right branch depending on the value of the projection. If 
it is smaller than or equal to the cutpoint s (median of the data 
points belonging to that node in the projected space) saved at 
that node, i.e. 

< s, 

the query point is routed into the left child node, and other- 
wise into the right child node. This process is then repeated 
recursively until a leaf is met. 

The query point is routed down into a leaf in all the T trees 
obtained in the index construction phase. The query process 
is thus far similar to the one described in | |25| . The principal 
difference is the candidate set generation; in classic RP trees, 
the candidate set S includes all the points that belong to the 
same leaf with the query point in at least one of the trees. A 

* The following complexity results hold exactly, if the algorithm is modified 
so that each random vector has exactly [ad] non-zero components instead of 
the expected number of non-zero components being ad. 



problem with this approach is that when a high number of trees 
are used in the tree traversal step, the size of the candidate set 
jS”! becomes excessively large. 

In the following, we show how the frequency information 
(i.e., how frequently a point falls into the same cell as query 
point) can be utilized to improve both query performance and 
accuracy. 

A. Voting search 

Assume that we have constructed T RP trees of depth l. 
Each of them partitions into 2^ cells (leaves) Li,... L 21 , 
all of which contain or [||J data points. For 1 <t <T, 
let ft be an indicator function of data point x S X and the 
query q, which returns 1, if x and q reside in the same cell 
in tree f, and 0 otherwise; 

/t(x;q) = ^ l{x e Lm,q e Ljn}- 

m—1 

Further, let F be a count function of data point x, which 
returns the number of trees in which x and q belong to the 
same leaf; 

T 

-F(x; q) = ^/t(x;q). 

i=l 

We propose a simple but effective voting search where we 
choose into the candidate set only data points residing in the 
same leaf as the query point in at least v trees: 

5' = {xeX:F(x;q) > u}. 

The vote threshold u is a tuning parameter. A lower threshold 
value yields higher accuracy at the expense of increased query 
times. 

This further pruning of the candidate set utilizes the intuitive 
notion that the doser the data point is to the query point, 
the more probably an RP tree divides the space so that they 
both belong to the same leaf. We emphasize that our voting 
scheme is not restricted to RP trees, but it can be used in 
combination with any form of space-partitioning algorithms, 
given that there is enough randomness involved in the process 
to render the partitions sufficiently independent. 

Pseudocode for the Online stage of the MRPT algorithm 
is given in detail in Algorithms [3]-^ below (the knn(q, k, S) 
function is a regular fc-NN search which returns k nearest 
neighbors for the point q from the set S). 

B. Time and space complexity: Query execution 

When the trees are grown into some predetermined depth 
f using the median split, the expected running time of the 
tree traversal step is <d{T£ad^ because at each level of each 
RP tree the query point is projected onto a d-dimensional 

^Adding the leaf points into the candidate set S for each T trees adds 
a term O into the time complexity of the tree traversal phase for 

defeatist search. Because checking the vote count is a constant time operation 
this term is the same for voting search. However, in both cases these terms 
are dominated by the exact search in the search set, which is an O (T-^d^ 
operation, so they are not taken into account here. 


Algorithm 3 Route a query point into a leaf in an RP tree. 
1: function TREE_QUERY(q, tree) 

2: R ^ tree.random matrix 

3: p ^ q^R 

4: root ^ tree.root 

5: for level in 1, do 

6: if p[level] < root.split then 

7: root ^ root.left 

8 : else 

9: root ^ root.right 

10: return data points in root 


Algorithm 4 Approximate /c-NN search using multiple RP 
trees._ 

1: function APPROXlMATE_KNN(q, k, trees, v) 

2 : 5^0 

3: let votes[l... X.nrows] be a new array 

4: for tree in trees do 

5: for point in TREE_QUERY(q, tree) do 

6: votes [point] ^ votes [point] + 1 

7: if votes [point] = v then 

8: 5 ^ F U {point} 

9: return knn(q, fc, S) 


random vector that has on average ad non-zero components. 
When using random vectors sampled from the d-dimensional 
standard normal distribution, this is Q{T£d), but when using 
value a = ^ for the sparsity parameter, it is only Q{T£s/d). 

When the median split is used, each leaf has either [ 
or [^J data points. Therefore, the size of the final search 
set satisfies jF] < for both defeatist and voting search. 

Increasing the vote threshold v decreases the size of the search 
set S in practice, which explains the significantly reduced 
query times when using voting search (cf. experimental results 
in Section IV 1 . However, because the effect of v on |iS'| 
depends on the distribution of the data, a detailed analysis 
of the time complexity requires further assumptions on the 
data source and is not addressed here. Without any assump¬ 
tions, both defeatist search and voting search have the same 
theoretical upper bound for jF} 

Because of this upper bound for the size of the candidate 
set, the running time of the exact search in the candidate set is 
O (T-^d) for both defeatist search and voting search. Thus, 
the total query time is O (Td{£ + ^)) for classic RP trees, 

and O 


{Td{ 




for the MRPT algorithm. 


IV. Fxperimental results 


We assess the efficiency of our algorithm by comparing 
its against the state-of-the-art approximate nearest neighbor 
search algorithms. To make the comparison as practically 
relevant as possible, we chose widely used methods that are 
available in optimized libraries. 

AU of the compared libraries, including ours, are imple- 
mented in C-n- and compiled with similar optimizations. We 












TABLE I; Time and space complexity of the algorithm com- 
pared to classic RP trees. 



RP trees 

(a=^) 

Query time 

o{Tdil+^)) 


Index construction time 

@{Tind) 

e(T£nVd) 

Index memory 

O (Tdn) 

O ^T{Vdlogn + n)^ 


TABLE II; Data sets used in the experiments 


Data set 

n 

d 

type 

GIST 

1000000 

960 

image descriptors 

SIFT 

2500000 

128 

image descriptors 

MNIST 

60000 

784 

image 

Trevi 

101120 

4096 

image 

STL-10 

100000 

9216 

image 

News 

262144 

1000 

text (TF-IDF + LSA) 

Random 

50000 

4096 

synthetic 


make our comparison and the implementation of our algorithm 
available as open-sourc^ 

All of the experiments were performed on a single computer 
with two Intel Xeon E5540 2.53GHz CPUs and 32GB of 
RAM. No parallelization beyond that achieved by the use of 
linear algebra libraries, such as caching and vectorization of 
matrix operations, was used in any of the experiments. 

A. Data sets 

We made the comparisons using several real-world data 
sets (see Table 0 across a wide range of sample size and 
dimensionality. The data sets used are typical in applications 
in which fast nearest neighbor search is required. Eor all data 
sets, we use a disjoint random sample of 100 test points to 
evaluate the query time and recall. 

The BIGANN and GIST data setØ consist of local 128- 
dimensional SIET descriptors and global 960-dimensional 
color GIST descriptors extracted from images p8) , p9) , 
respectively. As the SIET data set, we use a random sample 
of n = 2500000 data points from the BIGANN data set. 

In addition to feature descriptor data sets, we consider three 
image data sets for which no feature extraction has been 
performed; 1) the MNIST data sej^consists of 28 x 28 pixel 
grayscale images each of which is represented as a vector 
of dimension d = 784; 2) the Trevi data se|^ consists of 
64 X 64 pixel grayscale image patches represented as a vector 
of dimension d = 4096 and the STE-10 data seQ which 
is an image recognition data set containing 96 x 96 pixel 
images from 10 different classes of objects Ø. 

We also use a news data set that contains web pages from 
different news feeds converted into a term frequency-inverse 
document frequency (TE-IDE) representation. Dimensionality 
of the data is reduced to d = 1000 by applying latent semantic 
analysis (LSA) to the TE-IDE data. More elaborate description 
of the preprocessing of the news data set is found in ||^. 

Einally, we include results for a synthetic data set comprised 
of n = 50000 random vectors sampled from the 4096- 
dimensional standard normal distribution and normalized to 
unit length. 

^https://github.com/eiaasaari/mi'pt-comparison 

http://corpus-texmex.irisa.fr/ 

'http://yann.iecun.com/exdb/mnist/ 

^ http://phototour.cs.washmgton.edu/patches/defauit.htm 
' http://cs.stanford.edu/~acoates/sti 10 


B. Compared algorithms 

We compare the MRPT algorithm to representatives from 
all three major groups of approximate nearest neighbor search 
algorithms; tree-, graph- and hashing-based methods. 

To test the efficiency of our proposed improvements to 
classic RP trees, we include our own implementatiorj^ of 
classic RP trees suggested in p5) , and an MRPT-algorithm 
with vote threshold v = 1, which amounts to using sparse RP 
trees without the voting scheme. 

Of the tree-based methods we include randomized fc-d 
trees ||2^ and hierarchical fc-means trees ^23\ . Authors of 0 
suggest that these two algorithms give the best performance 
on data sets with a wide range of dimensionality. Both of the 
algorithms are implemented in ELANISJ^ 

We also use as a benchmark the classic k-d tree algo¬ 
rithm |16| modified for approximate nearest neighbor search 
as described in fp^ ; it is implemented in the ANN librarjj^ 

Among the graph-based methods, the NN-descent algorithm 
suggested in 03 is generally thought to be a leading method. 
It hrst constructs a /c-NN graph using local search in the offline 
phase, and then utilizes this graph for fast fc-NN queries in the 
Online phase. The algorithm is implemented in the KGraph 
librar>{^ 

As a representative of hashing-based Solutions we include a 
multi-probe version of cross-polytope LSH suggested in 112| 
as implemented in the FALCONN librar}p^ All these al¬ 
gorithms and implementations were selected because they 
represent the state-of-the-art in their respective categories. 

For the most important tuning parameters (as stated by the 
authors), we use grid search on the appropriate ranges to 
find the optimal parameter combinations in the experiment, 
or the default values whenever provided by the authors. The 
performance of the MRPT algorithm is controlled by the 
number of trees, the depth of the trees, the sparsity parameter 
a, and the vote threshold. We fix the sparsity parameter as 


^The version of RP trees used in the compaiisons is already a slightly 
improved version: we used RP trees with the same random vector at every 
level and median split (cf. section If we had used a different random 
vectors at each node of the tree we would have ran out of memory on the 
larger data sets. However, we emphasize that using the same random vector 
at every level actually slightly decreases the query times while keeping the 
accuracy intact; hence, the real performance gap between RP trees and the 
MRPT algorithm is actually even wider than shown in the results. 
'http://www.cs.ubc.ca/research/flann/ 
http://www.cs.umd.edu/~mount/ANN/ 

^ ^ http://www.kgraph.org 
http://www.falconn-lib.org 











a = 1/ \/d, and optimize the other parameters using the same 
grid search procedure as for the other methods. For ANN, 
we vary the error bound e, and for FLANN, we control the 
accuracy primarily by tuning the maximum number of leafs 
to search, but also vary the number of trees for k-d trees and 
the branching factor for hierarchical A:-means trees. KGraph is 
primarily controlled by the search cost parameter, but we also 
build the graph with varying number of nearest neighbors for 
each node. For FALCONN, we search for the optimal number 
of hash tables and hash functions. The supplementary material 
on GitHub provides the exact parameter ranges of grid search 
as well as interactive graphs that show the optimal values for 
each obtained result. 

C. Results 

In the experiments we used three different values of k that 
cover the range used in typical applications: k = 1,10 and 
100. However, due to space restrictions, we present results for 
multiple values of k only for MNIST, which is representative 
of the general trend. For the other six data sets we only provide 
results for k = 10. (The results for k = 1 ,100 are available 
in the supplementary material on GitHub). 

Figure [T] shows results for fc = 10 on six data sets. The 
times required to reach a given recall level are shown in Table 
The MRPT algorithm is significantly faster than other tree- 
based methods and LSH on all of the data sets, expect SIFT. 
It is worth noting that SIFT has a much lower dimension (d = 
128) than the other data sets. These results suggest that the 
proposed algorithm is more suitable for high-dimensional data 
sets. 

For the MNIST data set, results for all three values of k (and 
additionally for k = 25) are shown in Figure For the other 
data sets the results for k = 1 and k = 100 follow a similar 
pattern as the results for the MNIST data set; the choice of 
k has no significant effect to the relative performance of the 
other methods, but the relative performance of KGraph seems 
to degrade significantly for small values of k. 

KGraph is the fastest method on low recall levels, but its 
performance degrades rapidly on high recall levels (over 90%), 
thus making the MRPT algorithm the fastest method on very 
high recall levels (over 95%) on all the data sets except SIFT. 
Our hypothesis is that the flexibility obtained on the one hand 
by using completely random split directions, and on the other 
hand by using a high number of trees, which is enabled by 
sparsity and voting, allows the MRPT algorithm to obtain high 
recall levels for high-dimensional data sets very efficiently. 

The performance of the MRPT algorithm is also superior to 
classic RP trees on all of the data sets. In addition, from our 
two main contributions, voting seems the be more important 
than sparsity with respect to query times: sparse RP trees 
are only slightly faster than dense RP trees, the gap being 
somewhat larger on the higher-dimensional data sets, but the 
voting search provides a marked improvement on all data sets. 
This shows that the voting search, especially when combined 
with sparsity, is an efficient way to reduce query time without 
compromising accuracy. 


The numerical values in Table uni indicate that for recall 
levels > 90%, the MRPT method is the fastest in 14 out of 
21 instances, while the KGraph method is fastest in 7 out 21 
cases. Compared to brate force search, MRPT is roughly 25- 
100 times faster on all six real-world data sets at 90% recall 
level, and roughly 10^0 times faster even at the highest 99% 
recall level. 

V. CONCLUSION 

We propose the multiple random projection tree (MRPT) 
algorithm for approximate fc-nearest neighbor search in high 
dimensions. The method is based on combining multiple 
sparse random projection trees using a novel voting scheme 
where the final search is focused to points occurring most 
frequently among points retrieved by multiple trees. The algo¬ 
rithm is straightforward to implement and exploits standard 
fast linear algebra libraries by combining calculations into 
large matrix-matrix operations. 

We demonstrate through extensive experiments on both 
real and simulated data that the proposed method is faster 
than state-of-the-art space-partitioning tree and hashing based 
algorithms on a variety of accuracy levels. It is also faster than 
a leading graph based algorithm (KGraph) on high accuracy 
levels, while being slightly slower on low accuracy levels. The 
good performance of MRPT is especially pronounced for high- 
dimensional data sets. 

Due to its very competitive and consistent performance, 
and simple and efficient index construction stage — especially 
compared to graph-based algorithms — the proposed MRPT 
method is an ideal method for a wide variety of applications 
where high-dimensional large data sets are involved. 
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NN graph (kgraph) and our method (rmpt). We also include results for our implementation of classic RP-trees (rp trees) and 
RP-trees with sparsity (sparse rp trees). 
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TABLE III: Query times required to reach at least a given recall level for recall R = 80%, 90%, 95%, 99%, and for k = 10. 
Fastest times for each recall level are emphasized with bold font. MRPT consistently outperforms other algorithms at high 
recall levels. For example, for R > 95%, 71.4% of times MPRT achieves the fastest query time (i.e., 10 out of 14) whereas it 
is only 28.6% for KGraph (i.e., 4 out of 14). R = 100% by definition for brute force search. 


time (s), 100 queries 


data set 

R(%) 

ANN 

FALCONN 

FLANN-kd 

FLANN-kmeans 

KGraph 

RP-trees 

MRPT(ti = 1) 

MRPT 

brute force 


80 

0.89 

0.21 

0.02 

0.02 

0.02 

0.09 

0.05 

0.02 


MNIST 

90 

1.57 

0.36 

0.04 

0.04 

0.04 

0.16 

0.08 

0.03 

2.59 

95 

2.29 

0.55 

0.06 

0.05 

0.06 

0.2 

0.14 

0.04 


99 

3.46 

1.23 

0.15 

0.1 

0.44 

0.39 

0.22 

0.07 



80 

2.15 

0.39 

0.56 

0.26 

0.05 

0.71 

0.64 

0.12 


News 

90 

4.42 

0.88 

2.92 

0.45 

0.11 

1.37 

1.05 

0.2 

23.09 

95 

7.35 

1.23 

9.55 

1.39 

0.25 

1.98 

1.62 

0.28 


99 

11.83 

2.58 

38.81 

7.5 

3.24 

4.3 

3.87 
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80 

20.54 

21.21 

6.03 

1.86 

0.59 

4.77 

4.19 

1.03 


GIST 

90 

36.18 

29.19 

13.35 

3.63 

1.87 

7.69 

7.54 

1.83 

52.98 

95 

48.13 

37.92 

24.41 

6.22 

7.94 

13.6 

11.89 

2.91 


99 

76.57 

50.29 

59.12 

14.14 

- 

24.11 

20.33 

6.1 



80 

0.81 

0.93 

0.21 

0.09 

0.05 

0.4 

0.38 
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SIFT 
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0.27 


Trevi 
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25.54 
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- 

8.13 

5.25 

2.07 



80 

31.31 
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